Training Better CNNs Requires to Rethink ReLU
نویسندگان
چکیده
With the rapid development of Deep Convolutional Neural Networks (DCNNs), numerous works focus on designing better network architectures (i.e., AlexNet, VGG, Inception, ResNet and DenseNet etc.). Nevertheless, all these networks have the same characteristic: each convolutional layer is followed by an activation layer, a Rectified Linear Unit (ReLU) layer is the most used among them. In this work, we argue that the paired module with 1:1 convolution and ReLU ratio is not the best choice since it may result in poor generalization ability. Thus, we try to investigate the more suitable convolution and ReLU ratio for exploring the better network architectures. Specifically, inspired by Leaky ReLU, we focus on adopting the proportional module with N:M (N>M) convolution and ReLU ratio to design the better networks. From the perspective of ensemble learning, Leaky ReLU can be considered as an ensemble of networks with different convolution and ReLU ratio. We find that the proportional module with N:M (N>M) convolution and ReLU ratio can help networks acquire the better performance, through the analysis of a simple Leaky ReLU model. By utilizing the proportional module with N:M (N>M) convolution and ReLU ratio, many popular networks can form more rich representations in models, since the N:M (N>M) proportional module can utilize information more effectively. Furthermore, we apply this module in diverse DCNN models to explore whether is the N:M (N>M) convolution and ReLU ratio indeed more effective. From our experimental results, we can find that such a simple yet effective method achieves better performance in different benchmarks with various network architectures and the experimental results verify that the superiority of the proportional module. In addition, to our knowledge, it is the first time to introduced the proportional module in DCNN models. We think that our proposed method can help many researchers design the better network architectures. Introduction Nowadays, with the available of large scale image datasets eg ImageNet (Russakovsky et al. 2015)) as well as high performance computing resources eg GPU, deep Convolutional Neural Networks (CNNs) (LeCun et al. 1998) have been dominant in many computer vision applications, especially for image classification (Krizhevsky, Sutskever, and Hinton ∗Corresponding author. ([email protected]) Copyright c © 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. f (y)
منابع مشابه
Learning Non-overlapping Convolutional Neural Networks with Multiple Kernels
In this paper, we consider parameter recovery for non-overlapping convolutional neural networks (CNNs) with multiple kernels. We show that when the inputs follow Gaussian distribution and the sample size is sufficiently large, the squared loss of such CNNs is locally strongly convex in a basin of attraction near the global optima for most popular activation functions, like ReLU, Leaky ReLU, Squ...
متن کاملPhone recognition with hierarchical convolutional deep maxout networks
Deep convolutional neural networks (CNNs) have recently been shown to outperform fully connected deep neural networks (DNNs) both on low-resource and on large-scale speech tasks. Experiments indicate that convolutional networks can attain a 10–15 % relative improvement in the word error rate of large vocabulary recognition tasks over fully connected deep networks. Here, we explore some refineme...
متن کاملInvestigation of parametric rectified linear units for noise robust speech recognition
Convolutional neural networks with rectified linear unit (ReLU) have been successful in speech recognition and computer vision tasks. ReLU was proposed as a better match to biological neural activation functions compared to sigmoidal non-linearity function. However, ReLU has a disadvantage that the gradient is zero whenever the unit is not active or saturated. To alleviate the potential problem...
متن کاملCNNs are Globally Optimal Given Multi-Layer Support
Stochastic Gradient Descent (SGD) is the central workhorse for training modern CNNs. Although giving impressive empirical performance it can be slow to converge. In this paper we explore a novel strategy for training a CNN using an alternation strategy that offers substantial speedups during training. We make the following contributions: (i) replace the ReLU non-linearity within a CNN with posi...
متن کاملEraseReLU: A Simple Way to Ease the Training of Deep Convolution Neural Networks
For most state-of-the-art architectures, Rectified Linear Unit (ReLU) becomes a standard component accompanied by each layer. Although ReLU can ease the network training to an extent, the character of blocking negative values may suppress the propagation of useful information and leads to the difficulty of optimizing very deep Convolutional Neural Networks (CNNs). Moreover, stacking of layers w...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1709.06247 شماره
صفحات -
تاریخ انتشار 2017